INTERSPEECH.2015 - Speech Recognition | Cool Papers

#1 Mispronunciation detection without nonnative training data [PDF] [Copy] [Kimi²]

Conventional mispronunciation detection systems that have the capability of providing corrective feedback typically require a set of common error patterns that are known beforehand, obtained either by consulting with experts, or from a human-annotated nonnative corpus. In this paper, we propose a mispronunciation detection framework that does not rely on nonnative training data. We first discover an individual learner's possible pronunciation error patterns by analyzing the acoustic similarities across their utterances. With the discovered error candidates, we iteratively compute forced alignments and decode learner-specific context-dependent error patterns in a greedy manner. We evaluate the framework on a Chinese University of Hong Kong (CUHK) corpus containing both Cantonese and Mandarin speakers reading English. Experimental results show that the proposed framework effectively detects mispronunciations and also has a good ability to prioritize feedback.

#2 Automatic accentedness evaluation of non-native speech using phonetic and sub-phonetic posterior probabilities [PDF] [Copy] [Kimi²]

Authors: Ramya Rasipuram ; Milos Cernak ; Alexandre Nachen ; Mathew Magimai-Doss

Automatic evaluation of non-native speech accentedness has potential implications for not only language learning and accent identification systems but also for speaker and speech recognition systems. From the perspective of speech production, the two primary factors influencing the accentedness are the phonetic and prosodic structure. In this paper, we propose an approach for automatic accentedness evaluation based on comparison of instances of native and non-native speakers at the acoustic-phonetic level. Specifically, the proposed approach measures accentedness by comparing phone class conditional probability sequences corresponding to the instances of native and non-native speakers, respectively. We evaluate the proposed approach on the EMIME bilingual and EMIME Mandarin bilingual corpora, which contains English speech from native English speakers and various non-native English speakers, namely Finnish, German and Mandarin. We also investigate the influence of the granularity of the phonetic unit representation on the performance of the proposed accentedness measure. Our results indicate that the accentedness ratings by the proposed approach correlate consistently with the human ratings of accentedness. In addition, our studies show that the granularity of the phonetic unit representation that yields the best correlation with the human accentedness ratings varies with respect to the native language of the non-native speakers.

#3 Using F0 contours to assess nativeness in a sentence repeat task [PDF] [Copy] [Kimi²]

Authors: Min Ma ; Keelan Evanini ; Anastassia Loukina ; Xinhao Wang ; Klaus Zechner

In this paper, we conduct experiments using F0 contour features to assess the nativeness of responses provided by speakers from India and China to a Sentence Repeat task in an assessment of English speaking proficiency for non-native speakers. The results show that the coefficients from polynomial models of the pitch contours help distinguish between native and non-native speakers, especially among females. We find that the F0 contour can be represented adequately by using only basic statistical variables and the first three orders of polynomial coefficients. In addition, the most important features for classification are presented for each group of speakers. Finally, we discuss the differences among the gender-specific groups of the speakers.

#4 Using linguistic indicators of difficulty to identify mild cognitive impairment [PDF] [Copy] [Kimi²]

Authors: Rebecca Lunsford ; Peter A. Heeman

Speaking is a complex task, and it is to be expected that speech will be effected when a speaker is faced with cognitive difficulties. To explore how cognitive impairment is manifested in a persons' speech, we compared the speech of elders diagnosed with Mild Cognitive Impairment (MCI) to others who are cognitively intact, while the speakers attempt to retell a story they just heard. We found that the speakers with impairment, as compared to those who are cognitively intact, spent more time engaged in verbalized hesitations (e.g., “and um …”) prior to speaking story content, and that these verbalized hesitations accounted for a larger ratio of the time spent retelling. In addition, we found that a higher percentage of the impaired speakers used phrases such as “I guess” and “I can't recall” to qualify content they were unsure of, or to replace details they couldn't recall. These results provide insight into how speakers manage cognitive impairment, suggesting that these indicators of difficulty could be used to assist in early diagnosis of MCI.

#5 Automatic intelligibility measures applied to speech signals simulating age-related hearing loss [PDF] [Copy] [Kimi²]

Authors: Lionel Fontan ; Jérôme Farinas ; Isabelle Ferrané ; Julien Pinquier ; Xavier Aumont

This research work forms the first part of a long-term project designed to provide a framework for facilitating hearing aids tuning. The present study focuses on the setting up of automatic measures of speech intelligibility for the recognition of isolated words and sentences. Both materials were degraded in order to simulate presbycusis effects on speech perception. Automatic measures based on an Automatic Speech Recognition (ASR) system were applied to an audio corpus simulating the effects of presbycusis at nine severity stages. The results are compared to reference intelligibility scores collected from 60 French listeners. The aim of this system being to produce measures as close as possible to human behaviour, good performances were achieved since strong correlations between subjective and objective scores are observed.

#6 Assessing empathy using static and dynamic behavior models based on therapist's language in addiction counseling [PDF] [Copy] [Kimi²]

Authors: Sandeep Nallan Chakravarthula ; Bo Xiao ; Zac E. Imel ; David C. Atkins ; Panayiotis G. Georgiou

Empathy by the counselor is an important measure of treatment quality in psychotherapy. It is a behavioral process that involves understanding and sharing the experiences and emotions of a person over the course of an interaction. While a complex phenomenon, human behavior can at moments be perceived as strongly empathetic or non-empathetic. Currently, manual coding of behavior and behavioral signal processing models of empathy often pose the unnatural assumption that empathy is constant throughout the interaction. In this work we investigate two models: Static Behavior Model (SBM) that assumes a fixed degree of empathy throughout an interaction; and a context-dependent Dynamic Behavior Model (DBM), which assumes a Hidden Markov Model, allowing transitions between high- and low- empathy states. Through the non-causal human perception mechanisms, these states can be perceived and integrated as high- or low- gestalt empathy. We show that the DBM performs better than the SBM, while as a byproduct, generating local labels that may be of use to domain experts. We also demonstrate the robustness of both SBM and DBM to transcription errors stemming from ASR rather than human transcriptions. Our results suggest that empathy manifests itself in different forms over time and is best captured by context-dependent models.

#7 SVitchboard II and fiSVer i: high-quality limited-complexity corpora of conversational English speech [PDF] [Copy] [Kimi²]

Authors: Yuzong Liu ; Rishabh Iyer ; Katrin Kirchhoff ; Jeff Bilmes

In this paper, we introduce a set of benchmark corpora of conversational English speech derived from the Switchboard-I and Fisher datasets. Traditional ASR research requires considerable computational resources and has slow experimental turnaround times. Our goal is to introduce these new datasets to researchers in the ASR and machine learning communities (especially in academia), in order to facilitate the development of novel acoustic modeling techniques on smaller but acoustically rich corpora. We select these corpora to maximize an acoustic quality criterion while limiting the vocabulary size (from 10 words up to 10,000 words) with different state-of-the-art submodular function optimization algorithms. We provide baseline word recognition results for both GMM and DNN-based systems and release the corpora definitions and Kaldi training recipes to the public.

#8 Fully unsupervised small-vocabulary speech recognition using a segmental Bayesian model [PDF] [Copy] [Kimi²]

Authors: Herman Kamper ; Aren Jansen ; Sharon Goldwater

Current supervised speech technology relies heavily on transcribed speech and pronunciation dictionaries. In settings where unlabelled speech data alone is available, unsupervised methods are required to discover categorical linguistic structure directly from the audio. We present a novel Bayesian model which segments unlabelled input speech into word-like units, resulting in a complete unsupervised transcription of the speech in terms of discovered word types. In our approach, a potential word segment (of arbitrary length) is embedded in a fixed-dimensional space; the model (implemented as a Gibbs sampler) then builds a whole-word acoustic model in this space while jointly doing segmentation. We report word error rates in a connected digit recognition task by mapping the unsupervised output to ground truth transcriptions. Our model outperforms a previously developed HMM-based system, even when the model is not constrained to discover only the 11 word types present in the data.

#9 LSTM for punctuation restoration in speech transcripts [PDF] [Copy] [Kimi²]

Authors: Ottokar Tilk ; Tanel Alumäe

The output of automatic speech recognition systems is generally an unpunctuated stream of words which is hard to process for both humans and machines. We present a two-stage recurrent neural network based model using long short-term memory units to restore punctuation in speech transcripts. In the first stage, textual features are learned on a large text corpus. The second stage combines textual features with pause durations and adapts the model to speech domain. Our approach reduces the number of punctuation errors by up to 16.9% when compared to a decision tree that combines hidden-event language model posteriors with inter-word pause information, having largest improvements in period restoration.

#10 Noise robust exemplar matching for speech enhancement: applications to automatic speech recognition [PDF] [Copy] [Kimi²]

Authors: Emre Yılmaz ; Deepak Baby ; Hugo Van hamme

We present a novel automatic speech recognition (ASR) scheme which uses the recently proposed noise robust exemplar matching framework for speech enhancement in the front-end. The proposed system employs a GMM-HMM back-end to recognize the enhanced speech signals unlike the prior work focusing on template matching only. Speech enhancement is achieved using multiple dictionaries containing speech exemplars representing a single speech unit and several noise exemplars of the same length. These combined dictionaries are used to approximate the noisy segments and the speech component is obtained as a linear combination of the speech exemplars in the combined dictionaries yielding the minimum total reconstruction error. The performance of the proposed system is evaluated on the small vocabulary track of the 2nd CHiME Challenge and the AURORA-2 database and the results have shown the effectiveness of the proposed approach in improving the noise robustness of a conventional ASR system.

#11 A study on robust detection of pronunciation erroneous tendency based on deep neural network [PDF] [Copy] [Kimi²]

Authors: Yingming Gao ; Yanlu Xie ; Wen Cao ; Jinsong Zhang

Compared with scoring feedbacks, instructive feedbacks are more demanded by language learners using computer aided pronunciation training (CAPT) systems, which require detailed information about erroneous pronunciations along with phone errors. Pronunciation erroneous tendency (PET) defines a set of incorrect articulation configurations regarding main articulators and uttering manners for the phones respectively, and its robust detection contributes to the provision of appropriate instructive feedbacks. In our previous works, we designed a set of PET labels for CSL (Chinese as a second language) by Japanese learners, and conducted a preliminary detection study with GMM-HMM. This study is aimed at achieving a more robust detection of PETs by two approaches: employing DNN-HMM as the acoustic modeling, and comparing three kinds of acoustic features: MFCC, PLP, and filter-bank. Experimental results showed that the DNN-HMM PET modeling achieved more robust detection accuracies than the previous GMM-HMM, and the three kinds of features behaved differently. A lattice combination of the results of three feature systems led to the best PET results: FRR of 5.5%, FAR of 35.6%, and DA of 88.6%, which showed its efficiency.

#12 Vowel mispronunciation detection using DNN acoustic models with cross-lingual training [PDF] [Copy] [Kimi²]

Authors: Shrikant Joshi ; Nachiket Deo ; Preeti Rao

We address the automatic detection of phone-level mispronunciation for feedback in a computer-aided language learning task where the target language data (Indian English) is limited. Based on the recent success of DNN acoustic models on limited resource recognition tasks, we compare different methods of utilizing the limited target language data in the training of acoustic models that are initialized with multilingual data. Frame-level DNN posteriors obtained by the different training methods are compared in a phone classification task with a baseline GMM/HMM system. A judicious use of domain knowledge in terms of L2 phonology and L1 interference, that includes influence on phone quality and duration, are applied to the design of confidence scores for mispronunciation detection of vowels of Indian English as spoken by Gujarati L1 learners. We also show that the pronunciation error detection system benefits from a more precise signal-based segmentation of the test speech vowels, as would be expected due to the now more reliable frame-based confidence scores.

#13 Confidence-features and confidence-scores for ASR applications in arbitration and DNN speaker adaptation [PDF] [Copy] [Kimi²]

Authors: Kshitiz Kumar ; Ziad Al Bawab ; Yong Zhao ; Chaojun Liu ; Benoit Dumoulin ; Yifan Gong

Speech recognition confidence-scores quantitatively represent correctness of decoded utterances in a [0,1] range. Confidences have primarily been used to filter out recognitions with scores below a threshold. They have also been used in other speech applications in e.g. arbitration, ROVER, and high-quality data selection for model training etc. Confidence-scores are computed from a rich set of confidence-features in the speech recognition engine. While many speech applications consume confidence scores, we haven't seen adequate focus on directly consuming confidence-features in applications. In this work we build a thesis that additionally consuming confidence-features can provide big gains across confidence-related tasks. We demonstrate this for arbitration application, where we obtain 31% relative reduction in arbitration metric. We additionally demonstrate a novel application of confidence-scores in deep-neural-network (DNN) adaptation, where we strongly improve the relative reduction in word-error-rate (WER) for speaker adaptation on limited data.

#14 Topic modeling for conference analytics [PDF] [Copy] [Kimi]

Authors: Pengfei Liu ; Shoaib Jameel ; Wai Lam ; Bin Ma ; Helen Meng

This work presents our attempt to understand the research topics that characterize the papers submitted to a conference, by using topic modeling and data visualization techniques. We infer the latent topics from the abstracts of all the papers submitted to Interspeech2014 by means of Latent Dirichlet Allocation. Per-topic word distributions thus obtained are visualized through word clouds. We also compare the automatically inferred topics against the expert-defined topics (also known as tracks for Interspeech2014). The comparison is based on an information retrieval framework, where we use each latent topic as a query and each track as a document. For each latent topic, we retrieve a ranked list of tracks scored by the degree of word overlap. Each latent topic is associated with the top-scoring track. This analytic procedure was applied to all submissions to Interspeech2014 and sheds some interesting light in terms of providing an overview of topic categorization in the conference, popular versus unpopular topics, emerging topics and topic compositions. Such insights are potentially valuable for understanding the technical content of a field and planning the future development of its conference(s).

#15 Sparse coding based features for speech units classification [PDF] [Copy] [Kimi¹]

Authors: Pulkit Sharma ; Vinayak Abrol ; A. D. Dileep ; Anil Kumar Sao

In this paper a sparse representation based feature is proposed for the tasks in speech recognition. Dictionary plays an important role in order to get a good sparse representation. Therefore instead of using a single over complete dictionary, multiple signal adaptive dictionaries are used. A novel principal component analysis (PCA) based method is proposed to learn multiple dictionaries for each speech unit. For a given speech frame, first minimum distance criterion is employed to select appropriate dictionary and then a sparse solver is used to compute sparse feature for acoustic modeling. Experiments are performed using different datasets, which shows the proposed feature outperforms the existing features in recognition of isolated utterances.

#16 Bayesian integration of sound source separation and speech recognition: a new approach to simultaneous speech recognition [PDF] [Copy] [Kimi²]

Authors: Kousuke Itakura ; Izaya Nishimuta ; Yoshiaki Bando ; Katsutoshi Itoyama ; Kazuyoshi Yoshii

This paper presents a novel Bayesian method that can directly recognize overlapping utterances without explicitly separating mixture signals into their independent components in advance of speech recognition. The conventional approach to contaminated speech recognition in real environments uniquely extracts the clean isolated signals of individual sources ( e.g., by noise reduction, dereverberation, and source separation). One of the main limitations of this cascading approach is that the accuracy of speech recognition is upper bounded by the accuracy of preprocessing. To overcome this limitation, our method marginalizes out uncertain isolated speech signals by integrating source separation and speech recognition in a Bayesian manner. A sufficient number of samples are drawn from the posterior distribution of isolated speech signals by using a Markov chain Monte Carlo method, and then the posterior distributions of uttered texts for those samples are integrated. Under a certain condition, this Monte Carlo integration is shown to reduce to the well-known method called ROVER that integrates recognized texts obtained from sampled speech signals. Results of simultaneous speech recognition experiments showed that in terms of word accuracy the proposed method significantly outperformed conventional cascading methods.

#17 Channel selection in the short-time modulation domain for distant speech recognition [PDF] [Copy] [Kimi²]

Authors: Ivan Himawan ; Petr Motlicek ; Sridha Sridharan ; David Dean ; Dian Tjondronegoro

Automatic speech recognition from multiple distant microphones poses significant challenges because of noise and reverberations. The quality of speech acquisition may vary between microphones because of movements of speakers and channel distortions. This paper proposes a channel selection approach for selecting reliable channels based on selection criterion operating in the short-term modulation spectrum domain. The proposed approach quantifies the relative strength of speech from each microphone and speech obtained from beamforming modulations. The new technique is compared experimentally in the real reverb conditions in terms of perceptual evaluation of speech quality (PESQ) measures and word error rate (WER). Overall improvement in recognition rate is observed using delay-sum and superdirective beamformers compared to the case when the channel is selected randomly using circular microphone arrays.

#18 A multi-channel speech enhancement framework for robust NMF-based speech recognition for speech-impaired users [PDF] [Copy] [Kimi²]

Authors: Gert Dekkers ; Toon van Waterschoot ; Bart Vanrumste ; Bert Van Den Broeck ; Jort F. Gemmeke ; Hugo Van hamme ; Peter Karsmakers

In this paper a multi-channel speech enhancement framework for distant speech acquisition in noisy and reverberant environments for Non-negative Matrix Factorization (NMF)-based Automatic Speech Recognition (ASR) is proposed. The system is evaluated for its use in an assistive vocal interface for physically impaired and speech-impaired users. The framework utilises the Spatially Pre-processed Speech Distortion Weighted Multi-channel Wiener Filter (SP-SDW-MWF) in combination with a postfilter to reduce noise and reverberation. Additionally, the estimation uncertainty of the speech enhancement framework is propagated through the Mel-Frequency Cepstrum Coefficients (MFCC) feature extraction to allow for feature compensation in a later stage. Results indicate that a) using a trade-off parameter between noise reduction and speech distortion has a positive effect on the recognition performance with respect to the well-known GSC and MWF and b) the addition of a post-filter and the feature compensation increases performance with respect to several baselines for a non-pathological and pathological speaker.

#19 Sound source separation algorithm using phase difference and angle distribution modeling near the target [PDF] [Copy] [Kimi²]

Authors: Chanwoo Kim ; Kean K. Chin

In this paper we present a novel two-microphone sound source separation algorithm, which selects the signal from the target direction while suppressing signals from other directions. In this algorithm, which is referred to as Power Angle Information Near Target (PAINT), we first calculate phase difference for each time-frequency bin. From the phase difference, the angle of a sound source is estimated. For each frame, we represent the source angle distribution near the expected target location as a mixture of a Gaussian and a uniform distributions and obtain binary masks using hypothesis testing. Continuous masks are calculated from the binary masks using the Channel Weighting (CW) technique, and processed speech is synthesized using IFFT and the OverLap-Add (OLA) method. We demonstrate that the algorithm described in this paper shows better speech recognition accuracy compared to conventional approaches and our previous approaches.

#20 Contaminated speech training methods for robust DNN-HMM distant speech recognition [PDF] [Copy] [Kimi²]

Authors: Mirco Ravanelli ; Maurizio Omologo

Despite the significant progress made in the last years, state-of-the-art speech recognition technologies provide a satisfactory performance only in the close-talking condition. Robustness of distant speech recognition in adverse acoustic conditions, on the other hand, remains a crucial open issue for future applications of human-machine interaction. To this end, several advances in speech enhancement, acoustic scene analysis as well as acoustic modeling, have recently contributed to improve the state-of-the-art in the field. One of the most effective approaches to derive a robust acoustic modeling is based on using contaminated speech, which proved helpful in reducing the acoustic mismatch between training and testing conditions. In this paper, we revise this classical approach in the context of modern DNN-HMM systems, and propose the adoption of three methods, namely, asymmetric context windowing, close-talk based supervision, and close-talk based pre-training. The experimental results, obtained using both real and simulated data, show a significant advantage in using these three methods, overall providing a 15% error rate reduction compared to the baseline systems. The same trend in performance is confirmed either using a high-quality training set of small size, and a large one.

#21 Distance-aware DNNs for robust speech recognition [PDF] [Copy] [Kimi²]

Authors: Yajie Miao ; Florian Metze

Distant speech recognition (DSR) remains to be an open challenge, even for the state-of-the-art deep neural network (DNN) models. Previous work has attempted to improve DNNs under constantly distant speech. However, in real applications, the speaker-microphone distance (SMD) can be quite dynamic, varying even within a single utterance. This paper investigates how to alleviate the impact of dynamic SMD on DNN models. Our solution is to incorporate the frame-level SMD information into DNN training. Generation of the SMD information relies on a universal extractor that is learned on a meeting corpus. We study the utility of different architectures in instantiating the SMD extractor. On our target acoustic modeling task, two approaches are proposed to build distance-aware DNN models using the SMD information: simple concatenation and distance adaptive training (DAT). Our experiments show that in the simplest case, incorporating the SMD descriptors improves word error rates of DNNs by 5.6% relative. Further optimizing SMD extraction and integration results in more gains.

#22 Under-resourced speech recognition based on the speech manifold [PDF] [Copy] [Kimi²]

Authors: Reza Sahraeian ; Dirk Van Compernolle ; Febe de Wet

Conventional acoustic modeling involves estimating many parameters to effectively model feature distributions. The sparseness of speech and text data, however, degrades the reliability of the estimation process and makes speech recognition a challenging task. In this paper, we propose to use a nonlinear feature transformation based on the speech manifold called Intrinsic Spectral Analysis (ISA) for under-resourced speech recognition. First, we investigate the usefulness of ISA features in low resource scenarios for both Gaussian mixture and deep neural network (DNN) acoustic modeling. Moreover, due to the connection of ISA features to the articulatory configuration space, this feature space is potentially less language dependent than other typical spectral-based features, and therefore exploiting out-of-language data in this feature space is beneficial. We demonstrate the positive effect of ISA in the frame work of multilingual DNN systems where Flemish and Afrikaans are used as donor and under-resourced target languages respectively. We compare the performance of ISA with conventional features in both multilingual and under-resourced monolingual conditions.

#23 Multilingual features based keyword search for very low-resource languages [PDF¹] [Copy] [Kimi]

Authors: Pavel Golik ; Zoltán Tüske ; Ralf Schlüter ; Hermann Ney

In this paper we describe RWTH Aachen's system for keyword search (KWS) with very limited amount of transcribed audio data available in the target language. This setting has become this year's primary condition within the Babel project [1], seeking to minimize the amount of human effort while retaining a reasonable KWS performance. Thus the highlights presented in this paper include graphemic acoustic modeling; multilingual features trained on language data from the previous project periods; comparison of tandem and hybrid DNN-HMM acoustic models; processing of large amounts of text data available on the web and the morphological KWS based on automatically derived word fragments. The evaluation is performed using two training sets for each of the six current project period's languages — full language pack (FLP), consisting of 30 hours and very limited language pack (VLLP), comprising less than 3 hours of transcribed audio data. We put our focus on the latter of the two, which is clearly more challenging. The methods described in this work allowed us to exceed 0.3 MTWV on five out of six languages using development queries.

#24 Second language speech recognition using multiple-pass decoding with lexicon represented by multiple reduced phoneme sets [PDF] [Copy] [Kimi¹]

Authors: Xiaoyun Wang ; Seiichi Yamamoto

Considering that the pronunciation of second language speech is usually influenced by the mother tongue, we previously proposed using a reduced phoneme set for second language when the mother tongue of speakers is known. However, the proficiency of second language speakers varies widely, as does the influence of mother tongue on their pronunciation. Consequently, the optimal phoneme set is dependent on the proficiency of the second language speaker. In this work, we examine the relation between the proficiency of speakers and a reduced phoneme set customized for them. We propose a novel speech recognition method which is multiple-pass decoding using a lexicon represented by multiple reduced phoneme sets based on experimental results for speech recognition of second language speakers with various proficiencies. The relative error reduction obtained with the multiple reduced phoneme sets is 26.8% compared with the canonical one.

#25 Using resources from a closely-related language to develop ASR for a very under-resourced language: a case study for iban [PDF] [Copy] [Kimi²]

Authors: Sarah Samson Juan ; Laurent Besacier ; Benjamin Lecouteux ; Mohamed Dyab

This paper presents our strategies for developing an automatic speech recognition system for Iban, an under-resourced language. We faced several challenges such as no pronunciation dictionary and lack of training material for building acoustic models. To overcome these problems, we proposed approaches which exploit resources from a closely-related language (Malay). We developed a semi-supervised method for building the pronunciation dictionary and applied cross-lingual strategies for improving acoustic models trained with very limited training data. Both approaches displayed very encouraging results, which show that data from a closely-related language, if available, can be exploited to build ASR for a new language. In the final part of the paper, we present a zero-shot ASR using Malay resources that can be used as an alternative method for transcribing Iban speech.